You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Mohammad Kamrul Islam (JIRA)" <ji...@apache.org> on 2012/08/20 21:35:38 UTC

[jira] [Created] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Mohammad Kamrul Islam created MAPREDUCE-4568:
------------------------------------------------

             Summary: Throw "early" exception when duplicate files or archives are found in distributed cache
                 Key: MAPREDUCE-4568
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Mohammad Kamrul Islam


According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.

This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.

It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.

Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472526#comment-13472526 ] 

Robert Joseph Evans commented on MAPREDUCE-4568:
------------------------------------------------

I spoke with Virag about this before he filed the JIRA.  The main goal here is to provide a way for Oozie to be able to maintain a bit more of a semblance of backwards compatibility even after MAPREDUCE-4549 goes in.  They essentially want to de-dupe the entires in the dist cache that would cause an error.  We originally decided on having a exception thrown because it would allow for other errors/checks that may show up in the future to also be added in.  I don't think there would be a problem with adding in a new API that throws an exception if that API was also added into the 1.x line as well, but perhaps did not throw anything because there are not the same limitations.

I realize that adding in new APIs, especially since we already have 3 classes that have these types of APIs in them, is not ideal, but it is the only way to maintain backwards compatibility and evolve the API.
                
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Arun C Murthy
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Mohammad Kamrul Islam (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472981#comment-13472981 ] 

Mohammad Kamrul Islam commented on MAPREDUCE-4568:
--------------------------------------------------

In addition, it will be better, if there is a way of checking whether some file is already added in DC.

                
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Arun C Murthy
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Virag Kothari (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459236#comment-13459236 ] 

Virag Kothari commented on MAPREDUCE-4568:
------------------------------------------

Is there any update on this? 
Thanks!
                
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475993#comment-13475993 ] 

Arun C Murthy commented on MAPREDUCE-4568:
------------------------------------------

Clients can already query contents of DC...
                
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Arun C Murthy
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472173#comment-13472173 ] 

Arun C Murthy commented on MAPREDUCE-4568:
------------------------------------------

Ok, I spent an inordinate amount of time into this, but I finally am ready to give up.

Unfortunately none of the DistributedCache apis (i.e. DistributedCache.addCache(File|Archive) or Job.addCache(File|Archive) ) have an exception specification - this means we'll need to resort to throw RuntimeException or such which I'm not a fan of...

For now, I feel the best we can do (without breaking compat) is to just document this and leave it as it is... 

Thoughts?
                
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Arun C Murthy
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned MAPREDUCE-4568:
----------------------------------------

    Assignee: Arun C Murthy
    
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Arun C Murthy
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473271#comment-13473271 ] 

Robert Joseph Evans commented on MAPREDUCE-4568:
------------------------------------------------

Adding a true duplicate, exact same file multiple times, to the dist cache will not result in an error under YARN.  The MR client will just dedupe them before submitting the request to YARN.  The issue is when there are different files that will both map to the same key in the dist cache map (the key is the name of the symlink created in the working directory of the task/container).  Then is where it will throw an exception under 2.0
                
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Arun C Murthy
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4568) Throw "early" exception when duplicate files or archives are found in distributed cache

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473277#comment-13473277 ] 

Jason Lowe commented on MAPREDUCE-4568:
---------------------------------------

bq. In addition, it will be better, if there is a way of checking whether some file is already added in DC.

Would adding an interface so the client can query the contents of the DC before job submission be sufficient?  This seems like a reasonable enhancement that doesn't overlap with existing interfaces.  Or do you think it's still a requirement to throw early when adding a collision?  Throwing will require adding a new interface for adding to the DC which overlaps with existing functionality and adds to the pile of APIs we already have for adding things to the DC.
                
> Throw "early" exception when duplicate files or archives are found in distributed cache
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4568
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4568
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Arun C Murthy
>
> According to #MAPREDUCE-4549, Hadoop 2.x throws exception if duplicates found in cacheFiles or cacheArchives. The exception  throws during job submission.
> This JIRA is to throw the exception ==early== when it is first added to the Distributed Cache through addCacheFile or addFileToClassPath.
> It will help the client to decide whether to fail-fast or continue w/o the duplicated entries.
> Alternatively, Hadoop could provide a knob where user will choose whether to throw error( coming behavior) or silently ignore (old behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira