You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Philip Zeyliger (JIRA)" <ji...@apache.org> on 2009/06/01 07:38:07 UTC

[jira] Updated: (HADOOP-2914) extend DistributedCache to work locally (LocalJobRunner)

     [ https://issues.apache.org/jira/browse/HADOOP-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip Zeyliger updated HADOOP-2914:
------------------------------------

    Attachment: HADOOP-2914-v1-full.patch
                HADOOP-2914-v1-since-4041.patch

I set out to get DistributedCache to work on local job runner --- which wasn't too tricky --- but I ended up refactoring the DistributedCache code quite a bit, which has made this patch large and perhaps unfriendly.

DistributedCache code is used in three places:
# In user code, to (1) configure files to be cached and (2) retrieve the URIs of those files at runtime,
# In JobClient, to record some metadata information about the files desired in user code,
# And in TaskTracker/TaskRunner, to (1) maintain the cache, and (2) configure the cache per task.

Most of the code for all of these uses was in public static methods in DistributedCache.java, though some pretty complicated logic about the DistributedCache was also in TaskTracker.java and TaskRunner.java.  This made it tricky to tease out what the sacrosanct public APIs were.  My interpretation is that the methods described in the documentation (http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache) are public APIs, and I have left those, and a few others, in tact.  I separated out the other logic into two other classes, so that then I could avoid duplication between TaskRunner and LocalJobRunner.

The current patch depends on HADOOP-4041, so I've attached two patches: one for Hudson, and another if you don't want to revisit the intersection with 4041 (which is largely uninteresting: either way code moves out of TaskRunner into DistributedCacheHandle).

I've added some tests.  TestDistributedCache has become TestDistributedCacheManager, and there's a new test in there.  TestMRWithDistributedCache tests against both local and MiniMRClusters.  I've also tested using streaming, with commands like:
{noformat}
bin/hadoop jar build/contrib/streaming/hadoop-0.21.0-dev-streaming.jar \
  -files /etc/passwd -input /dev/null -output /tmp/output1 -mapper 'sh -c "test ! -z $mapred_cache_localFiles"'
bin/hadoop jar build/contrib/streaming/hadoop-0.21.0-dev-streaming.jar \
  -jt local -files /etc/passwd -input /dev/null -output /tmp/output2 -mapper 'sh -c "test ! -z $mapred_cache_localFiles"'
{noformat}
Is there a place where tests that use streaming to check other functionality could be checked in?

I wanted to stop somewhere and send this out, but I can think of several potential future JIRAs:

* The DistributedCache is in core/, but it only makes sense with mapred, so it probably should be relocated to mapred.
* There's more work to be done to separate out the public interfaces from the private ones.  The timestamp handling that's done by JobClient should really be done by something within the filecache package, for example.  Much of the annoyance here stems from the haphazard ways in which Hadoop jobs serialize some configuration data to the configuration file.  DistributedCache uses, I believe, 6 configuration keys, just to store ("file", "archive", "file+classpath", "archive+classpath", "filetimestamp", "archive+timestamp").
* Speaking of configuration, DistributedCache will not likely work for files with a comma in their path, though perhaps URI encoding saves us there.
* I haven't touched the DistributedCacheManager code except to move it there, but I suspect it could be significantly simplified now that it contains a Configuration object.
* It's my belief that SVN r696957 (HADOOP-249) turned off the symlink feature and that it hasn't worked since then.  That said, I haven't yet written the test that would confirm this.

Looking forward to your feedback. -- Philip


> extend DistributedCache to work locally (LocalJobRunner)
> --------------------------------------------------------
>
>                 Key: HADOOP-2914
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2914
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: sam rash
>            Priority: Minor
>         Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch
>
>
> The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html 
> Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.