You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2017/04/03 13:08:41 UTC

[jira] [Commented] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953441#comment-15953441 ] 

Jason Lowe commented on MAPREDUCE-6874:
---------------------------------------

This is a limitation with distributed cache.  It can be very expensive to do a full-depth traversal of a directory tree, and the API only supports one timestamp for a distributed cache entry.  Not only is it expensive to perform the stats of the tree in order to see if it is changed, it's also expensive to localize the files.  There's RPC overhead for each file in the tree.

It is much more efficient, and safer, for an archive (e.g.: .tar.gz, .zip, etc.) to be used instead of a directory.  Then there's only one timestamp we need to check to know if anything in the "tree" has changed.  Arguably directory trees shouldn't be supported in the distributed cache at all, but I believe they were added way back when to support use cases where a chain of MapReduce jobs needed the output of a previous job (i.e.: a directory) to be used as a cache file for the next job (e.g.: a map-side join).

> Make DistributedCache check if the content of a directory has changed
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6874
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6874
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>
> DistributedCache does not check recursively if the content a directory has changed when adding files to it with {{DistributedCache.addCacheFile()}}. 
> h5. Background
> I have an Oozie workflow on HDFS:
> {code}
> example_workflow
> ├── job.properties
> ├── lib
> │   ├── components
> │   │   ├── sub-component.sh
> │   │   └── subsub
> │   │       └── subsub.sh
> │   ├── main.sh
> │   └── sub.sh
> └── workflow.xml
> {code}
> Executed the workflow; then made some changes in {{subsub.sh}}. Replaced the file on HDFS. When I re-ran the workflow, DistributedCache did not notice the changes as the timestamp on the {{components}} directory did not change. As a result, the old script was materialized.
> This behaviour might be related to [determineTimestamps() |https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/filecache/ClientDistributedCacheManager.java#L84].
> In order to use the new script during workflow execution, I had to update the whole {{components}} directory.
> h6. Some more info:
> In Oozie, [DistributedCache.addCacheFile() |https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L625] is used to add files to the distributed cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org