You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Allen Wittenauer (Created) (JIRA)" <ji...@apache.org> on 2012/02/06 23:50:59 UTC

[jira] [Created] (MAPREDUCE-3824) Distributed caches are not removed properly

Distributed caches are not removed properly
-------------------------------------------

                 Key: MAPREDUCE-3824
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: distributed-cache
    Affects Versions: 1.0.0
            Reporter: Allen Wittenauer
            Priority: Critical


Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Thomas Graves (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated MAPREDUCE-3824:
-------------------------------------

    Target Version/s: 1.0.1
              Status: Patch Available  (was: Open)
    
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Thomas Graves (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves reassigned MAPREDUCE-3824:
----------------------------------------

    Assignee: Thomas Graves
    
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Thomas Graves (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated MAPREDUCE-3824:
-------------------------------------

    Attachment: MAPREDUCE-3824-branch-1.0.patch

patch changes:
 * refcount to be Atomic Integer 
 * private caches properly update sizes
 * add test for sizes
 
 
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Thomas Graves (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208923#comment-13208923 ] 

Thomas Graves commented on MAPREDUCE-3824:
------------------------------------------

I did a bunch of manual testing.  I tested private and public version of files and directories and also private and public archives going into the distributed cache and verified that the sizes were set properly.  If there are any other cases I missed please let me know.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allen Wittenauer updated MAPREDUCE-3824:
----------------------------------------

    Attachment: MAPREDUCE-3824-branch-1.0.txt

Here's a quick hack that fixes both issues on my systems.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Thomas Graves (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208050#comment-13208050 ] 

Thomas Graves commented on MAPREDUCE-3824:
------------------------------------------

After some debugging, it appears that the size isn't being calculated properly (set to 0) if the user specifies a directory to go into the distributed cache. It only appears to happen if its a private cached directory. I'm working on a patch.

Allen, can you confirm that your users were specifying a directory to be cached and not a file?
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Matt Foley (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Foley updated MAPREDUCE-3824:
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.1
           Status: Resolved  (was: Patch Available)

Committed to branch-1.0 and branch-1.
Thanks, Thomas, Allen, and reviewers!
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>             Fix For: 1.0.1
>
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202622#comment-13202622 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

(well the tasks that the tasks launch, not the TT itself)
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209112#comment-13209112 ] 

Hadoop QA commented on MAPREDUCE-3824:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12514754/MAPREDUCE-3824-branch-1.0.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1878//console

This message is automatically generated.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Matt Foley (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Foley updated MAPREDUCE-3824:
----------------------------------

    Target Version/s: 1.0.2  (was: 1.0.1)
       Fix Version/s:     (was: 1.0.1)
                      1.0.2

Was committed to 1.0.2, not 1.0.1.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>             Fix For: 1.0.2
>
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201736#comment-13201736 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

This appears to be related to two problems:

* race conditions around refcount
* private caches are never updated to reflect their true size

                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209819#comment-13209819 ] 

Robert Joseph Evans commented on MAPREDUCE-3824:
------------------------------------------------

I like the patch. +1

I would prefer to see the debug log for "Error cleaning up cache" to be an actual error log.  But that is minor as it is debug now, and the patch just changes the information logged.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208927#comment-13208927 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

It would be good to add some debug statements. When stuff does break (related or not), ops teams end up being completely blind.  (This is a problem with many many many other parts of Hadoop!)
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202621#comment-13202621 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

PS, if the tasktracker can survive disk failure in 1.0, it must not work under these conditions.  So many failures...
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202612#comment-13202612 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

There is no doubt the patch is a hack, but it solved my immediate problems because as it stands, distributed caches are really broken at scale.

Some background.  I have a team of users that have several 36GB distributed caches. When these caches are in play, most of the system is basically locked while these caches get built.  This patch was really geared towards making sure that these massive caches at least get deleted.  Without these patches in place, the mapred tmp spaces fill and tasks fail, eventually leading to mapred framework collapse. 

There are a lot of other problems that show up with caches this large:
* Hadoop doesn't have a size limit check on caches as part of the job submission process [So any hand waving about "don't use caches that big!" are null and void since there is no way to actually stop a user from doing that!]
* the setup and cleanup tasks also trigger cache downloads.
* tasktrackers appear to be frozen for *all* tasks during cache downloads, with the task stuck in the extremely unhelpful "unassigned" state.
* the methodology of updating the private cache as a different step seems unnecessary given the permissions at the file system level.

What really needs to happen is a massive overhaul of the entire distributed cache system.  But that's a bigger project, preferably for someone who gets paid to do hadoop development full time.  So, like all of the patches I've been submitting lately, I'm not expecting them to get committed. But this is enough of a patch for someone who needs a useable system until a working release ships.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202647#comment-13202647 ] 

Robert Joseph Evans commented on MAPREDUCE-3824:
------------------------------------------------

I agree with you the the 1.0 distributed cache code is a mess.  I don't really understand all of what it is doing, especially with how it interacts with other parts of TT.  I agree a rewrite would be good. And I think that is what YARN is for.  I would be happy to commit your patch for you.  At a minimum the volatile change needs to go in, and the rest I just want to feel comfortable that it will not cause more harm then good.  Also if you want to place limits on the size of files placed in the distributed cache, please file a JIRA for that, it would be a great addition.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208908#comment-13208908 ] 

Hadoop QA commented on MAPREDUCE-3824:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12514727/MAPREDUCE-3824-branch-1.0.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1871//console

This message is automatically generated.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202788#comment-13202788 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

Part of the reason why distribute cache is a mystery as to how it actually works in practice is due to the lack of active debugging.  Note that most of this patch is LOG.debug() statements. :(

                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Thomas Graves (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209110#comment-13209110 ] 

Thomas Graves commented on MAPREDUCE-3824:
------------------------------------------

test-patch results. The findbugs warnings aren't from this patch.  They occurred without this patch also.

    [exec] 
     [exec] 
     [exec] -1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     -1 findbugs.  The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Thomas Graves (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated MAPREDUCE-3824:
-------------------------------------

    Attachment: MAPREDUCE-3824-branch-1.0.patch

added some debug statements.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.patch, MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202496#comment-13202496 ] 

Robert Joseph Evans commented on MAPREDUCE-3824:
------------------------------------------------

I like the concept of the patch.  Volatile is definitely needed here, my bad on that one.  I also like that you are doing a DU to update the size of the cached objects if they are 0.  I do have some issues with the patch though.

The first is that even though the DU size update is being done on a separate thread it is being done with the cachedArchives lock held.  The amount of time it takes to do a DU could be significant.  Nothing new can be added to the cache while the cachedArchives lock is held, so it could be blocking other new tasks from making progress.  I would really prefer to see this done in two passes, similar to how we delete out entries.  The first pass would go through all entries and identify those that need to be updated, the second pass would be to update those entries without the lock held.  Then once we have all of the entries updated we can look at cleaning up the distributed cache. 

The second is that we are updating the size too late.  We decide how much space needs to be deleted to get us back under the desired amount based totally on the size reported by BaseDirManager, which in turn gets its data from the CacheStatus object.  The issue is that in the current patch we first calculate how much needs to be removed, then we update the size of the archives, then we delete them.  This is not that critical, because it just means that in the next pass they would be deleted, so this is really very minor, but should be covered by doing the update in two passes.

I am not sure exactly what are the situations that the size is not being set.  I would like to know exactly which situations the current code is missing, because like I said previously the code that computes the used size goes completely off of what is reported to BaseDirManager, unfortunately there are some issues with BaseDirManger where if we are too aggressive with setting the size we might double count some archives, which eventually would make it so that the BaseDirManager thinks it is full all the time, which would be very bad.

                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Posted by "Allen Wittenauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208090#comment-13208090 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

Yes, that corresponds to what I was seeing as well.  Sorry, forgot to mention the directory thing.  I've been running the patch for so long and so successfully I forgot about that detail.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira