You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Larry Compton <la...@gmail.com> on 2010/04/15 21:56:38 UTC

Distributed Cache with New API

I'm trying to use the distributed cache in a MapReduce job written to the
new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file path is
added to the distributed cache as follows:

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        Job job = new Job(conf, "Job");
        ...
        DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
        ...
        return job.waitForCompletion(true) ? 0 : 1;
    }

The "setup()" method in my mapper tries to read the path as follows:

    protected void setup(Context context) throws IOException {
        Path[] paths = DistributedCache.getLocalCacheFiles(context
                .getConfiguration());
    }

But "paths" is null.

I'm assuming I'm setting up the distributed cache incorrectly. I've seen a
few hints in previous mailing list postings that indicate that the
distributed cache is accessed via the Job and JobContext objects in the
revised API, but the javadocs don't seem to support that.

Thanks.
Larry

Re: Distributed Cache with New API

Posted by Larry Compton <la...@gmail.com>.
Thanks. That clears it up.

Larry

On Fri, Apr 16, 2010 at 1:05 AM, Amareshwari Sri Ramadasu <
amarsri@yahoo-inc.com> wrote:

> Hi,
> @Ted, below code is internal code. Users are not expected to call
> DistributedCache.getLocalCache(), they cannot use it also. They do not know
> all the parameters.
> @Larry, DistributedCache is not changed to use new api in branch 0.20. The
> change is done in only from branch 0.21. See MAPREDUCE-898 (
> https://issues.apache.org/jira/browse/MAPREDUCE-898).
> If you are using branch 0.20, you are encouraged to use deprecated JobConf
> itself.
> You can try the following change in your code:
> Change the line > > >        DistributedCache.addCacheFile(new
> Path(args[0]).toUri(), conf);
>  to DistributedCache.addCacheFile(new Path(args[0]).toUri(),
> job.getConfiguration());
>
> Thanks
> Amareshwari
>
> On 4/16/10 2:27 AM, "Ted Yu" <yu...@gmail.com> wrote:
>
> Please take a look at the loop starting at line 158 in TaskRunner.java:
>            p[i] = DistributedCache.getLocalCache(files[i], conf,
>                                                  new Path(baseDir),
>                                                  fileStatus,
>                                                  false, Long.parseLong(
>
> fileTimestamps[i]),
>                                                  new Path(workDir.
>                                                        getAbsolutePath()),
>                                                  false);
>          }
>          DistributedCache.setLocalFiles(conf, stringifyPathArray(p));
>
> I think the confusing part is that DistributedCache.getLocalCacheFiles() is
> paired with DistributedCache.setLocalFiles()
>
> Cheers
>
> On Thu, Apr 15, 2010 at 1:16 PM, Larry Compton
> <la...@gmail.com>wrote:
>
> > Ted,
> >
> > Thanks. I have looked at that example. The javadocs for DistributedCache
> > still refer to deprecated classes, like JobConf. I'm trying to use the
> > revised API.
> >
> > Larry
> >
> > On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Please see the sample within
> > > src\core\org\apache\hadoop\filecache\DistributedCache.java:
> > >
> > >  *     JobConf job = new JobConf();
> > >  *     DistributedCache.addCacheFile(new
> > > URI("/myapp/lookup.dat#lookup.dat"),
> > >  *                                   job);
> > >
> > >
> > > On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
> > > <la...@gmail.com>wrote:
> > >
> > > > I'm trying to use the distributed cache in a MapReduce job written to
> > the
> > > > new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file
> > path
> > > is
> > > > added to the distributed cache as follows:
> > > >
> > > >    public int run(String[] args) throws Exception {
> > > >        Configuration conf = getConf();
> > > >        Job job = new Job(conf, "Job");
> > > >        ...
> > > >        DistributedCache.addCacheFile(new Path(args[0]).toUri(),
> conf);
> > > >        ...
> > > >        return job.waitForCompletion(true) ? 0 : 1;
> > > >    }
> > > >
> > > > The "setup()" method in my mapper tries to read the path as follows:
> > > >
> > > >    protected void setup(Context context) throws IOException {
> > > >        Path[] paths = DistributedCache.getLocalCacheFiles(context
> > > >                .getConfiguration());
> > > >    }
> > > >
> > > > But "paths" is null.
> > > >
> > > > I'm assuming I'm setting up the distributed cache incorrectly. I've
> > seen
> > > a
> > > > few hints in previous mailing list postings that indicate that the
> > > > distributed cache is accessed via the Job and JobContext objects in
> the
> > > > revised API, but the javadocs don't seem to support that.
> > > >
> > > > Thanks.
> > > > Larry
> > > >
> > >
> >
>
>

Re: Distributed Cache with New API

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.
Hi,
@Ted, below code is internal code. Users are not expected to call DistributedCache.getLocalCache(), they cannot use it also. They do not know all the parameters.
@Larry, DistributedCache is not changed to use new api in branch 0.20. The change is done in only from branch 0.21. See MAPREDUCE-898 ( https://issues.apache.org/jira/browse/MAPREDUCE-898).
If you are using branch 0.20, you are encouraged to use deprecated JobConf itself.
You can try the following change in your code:
Change the line > > >        DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
 to DistributedCache.addCacheFile(new Path(args[0]).toUri(), job.getConfiguration());

Thanks
Amareshwari

On 4/16/10 2:27 AM, "Ted Yu" <yu...@gmail.com> wrote:

Please take a look at the loop starting at line 158 in TaskRunner.java:
            p[i] = DistributedCache.getLocalCache(files[i], conf,
                                                  new Path(baseDir),
                                                  fileStatus,
                                                  false, Long.parseLong(

fileTimestamps[i]),
                                                  new Path(workDir.
                                                        getAbsolutePath()),
                                                  false);
          }
          DistributedCache.setLocalFiles(conf, stringifyPathArray(p));

I think the confusing part is that DistributedCache.getLocalCacheFiles() is
paired with DistributedCache.setLocalFiles()

Cheers

On Thu, Apr 15, 2010 at 1:16 PM, Larry Compton
<la...@gmail.com>wrote:

> Ted,
>
> Thanks. I have looked at that example. The javadocs for DistributedCache
> still refer to deprecated classes, like JobConf. I'm trying to use the
> revised API.
>
> Larry
>
> On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Please see the sample within
> > src\core\org\apache\hadoop\filecache\DistributedCache.java:
> >
> >  *     JobConf job = new JobConf();
> >  *     DistributedCache.addCacheFile(new
> > URI("/myapp/lookup.dat#lookup.dat"),
> >  *                                   job);
> >
> >
> > On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
> > <la...@gmail.com>wrote:
> >
> > > I'm trying to use the distributed cache in a MapReduce job written to
> the
> > > new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file
> path
> > is
> > > added to the distributed cache as follows:
> > >
> > >    public int run(String[] args) throws Exception {
> > >        Configuration conf = getConf();
> > >        Job job = new Job(conf, "Job");
> > >        ...
> > >        DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
> > >        ...
> > >        return job.waitForCompletion(true) ? 0 : 1;
> > >    }
> > >
> > > The "setup()" method in my mapper tries to read the path as follows:
> > >
> > >    protected void setup(Context context) throws IOException {
> > >        Path[] paths = DistributedCache.getLocalCacheFiles(context
> > >                .getConfiguration());
> > >    }
> > >
> > > But "paths" is null.
> > >
> > > I'm assuming I'm setting up the distributed cache incorrectly. I've
> seen
> > a
> > > few hints in previous mailing list postings that indicate that the
> > > distributed cache is accessed via the Job and JobContext objects in the
> > > revised API, but the javadocs don't seem to support that.
> > >
> > > Thanks.
> > > Larry
> > >
> >
>


Re: Distributed Cache with New API

Posted by Ted Yu <yu...@gmail.com>.
Please take a look at the loop starting at line 158 in TaskRunner.java:
            p[i] = DistributedCache.getLocalCache(files[i], conf,
                                                  new Path(baseDir),
                                                  fileStatus,
                                                  false, Long.parseLong(

fileTimestamps[i]),
                                                  new Path(workDir.
                                                        getAbsolutePath()),
                                                  false);
          }
          DistributedCache.setLocalFiles(conf, stringifyPathArray(p));

I think the confusing part is that DistributedCache.getLocalCacheFiles() is
paired with DistributedCache.setLocalFiles()

Cheers

On Thu, Apr 15, 2010 at 1:16 PM, Larry Compton
<la...@gmail.com>wrote:

> Ted,
>
> Thanks. I have looked at that example. The javadocs for DistributedCache
> still refer to deprecated classes, like JobConf. I'm trying to use the
> revised API.
>
> Larry
>
> On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Please see the sample within
> > src\core\org\apache\hadoop\filecache\DistributedCache.java:
> >
> >  *     JobConf job = new JobConf();
> >  *     DistributedCache.addCacheFile(new
> > URI("/myapp/lookup.dat#lookup.dat"),
> >  *                                   job);
> >
> >
> > On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
> > <la...@gmail.com>wrote:
> >
> > > I'm trying to use the distributed cache in a MapReduce job written to
> the
> > > new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file
> path
> > is
> > > added to the distributed cache as follows:
> > >
> > >    public int run(String[] args) throws Exception {
> > >        Configuration conf = getConf();
> > >        Job job = new Job(conf, "Job");
> > >        ...
> > >        DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
> > >        ...
> > >        return job.waitForCompletion(true) ? 0 : 1;
> > >    }
> > >
> > > The "setup()" method in my mapper tries to read the path as follows:
> > >
> > >    protected void setup(Context context) throws IOException {
> > >        Path[] paths = DistributedCache.getLocalCacheFiles(context
> > >                .getConfiguration());
> > >    }
> > >
> > > But "paths" is null.
> > >
> > > I'm assuming I'm setting up the distributed cache incorrectly. I've
> seen
> > a
> > > few hints in previous mailing list postings that indicate that the
> > > distributed cache is accessed via the Job and JobContext objects in the
> > > revised API, but the javadocs don't seem to support that.
> > >
> > > Thanks.
> > > Larry
> > >
> >
>

Re: Distributed Cache with New API

Posted by Larry Compton <la...@gmail.com>.
Ted,

Thanks. I have looked at that example. The javadocs for DistributedCache
still refer to deprecated classes, like JobConf. I'm trying to use the
revised API.

Larry

On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:

> Please see the sample within
> src\core\org\apache\hadoop\filecache\DistributedCache.java:
>
>  *     JobConf job = new JobConf();
>  *     DistributedCache.addCacheFile(new
> URI("/myapp/lookup.dat#lookup.dat"),
>  *                                   job);
>
>
> On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
> <la...@gmail.com>wrote:
>
> > I'm trying to use the distributed cache in a MapReduce job written to the
> > new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file path
> is
> > added to the distributed cache as follows:
> >
> >    public int run(String[] args) throws Exception {
> >        Configuration conf = getConf();
> >        Job job = new Job(conf, "Job");
> >        ...
> >        DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
> >        ...
> >        return job.waitForCompletion(true) ? 0 : 1;
> >    }
> >
> > The "setup()" method in my mapper tries to read the path as follows:
> >
> >    protected void setup(Context context) throws IOException {
> >        Path[] paths = DistributedCache.getLocalCacheFiles(context
> >                .getConfiguration());
> >    }
> >
> > But "paths" is null.
> >
> > I'm assuming I'm setting up the distributed cache incorrectly. I've seen
> a
> > few hints in previous mailing list postings that indicate that the
> > distributed cache is accessed via the Job and JobContext objects in the
> > revised API, but the javadocs don't seem to support that.
> >
> > Thanks.
> > Larry
> >
>

Re: Distributed Cache with New API

Posted by hgahlot <hi...@gmail.com>.

hgahlot wrote:
> 
> I had the same problem but Amreshwari's suggestion solved it. I am porting
> a code from the 0.18.3 API to 0.20.2 API. I am now facing problems with
> the setting of keys through Configuration object. The value set during
> configuration using conf.setBoolean(<String value>, <default boolean
> value>) is not retrieved in the mapper. I then ported the WordCountv2.0
> example provided in the MapReduce tutorial and upgraded it to use the new
> API but it has the same problem. It works fine with the 0.18.3 API but
> fails in the upgraded version. When I try to get the name of the input
> file using 
> inputFile = conf.get("map.input.file");
> it prints null...
> Kindly let me know how to set values of these user-defined keys in the new
> API.
> 
using job.getConfiguration().set(...) instead of conf.set(...) solved it.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Distributed-Cache-with-New-API-tp722187p955402.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Distributed Cache with New API

Posted by hgahlot <hi...@gmail.com>.
I had the same problem but Amreshwari's suggestion solved it. I am porting a
code from the 0.18.3 API to 0.20.2 API. I am now facing problems with the
setting of keys through Configuration object. The value set during
configuration using conf.setBoolean(<String value>, <default boolean value>)
is not retrieved in the mapper. I then ported the WordCountv2.0 example
provided in the MapReduce tutorial and upgraded it to use the new API but it
has the same problem. It works fine with the 0.18.3 API but fails in the
upgraded version. When I try to get the name of the input file using 
inputFile = conf.get("map.input.file");
it prints null...
Kindly let me know how to set values of these user-defined keys in the new
API.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Distributed-Cache-with-New-API-tp722187p952861.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Distributed Cache with New API

Posted by Ted Yu <yu...@gmail.com>.
Please see the sample within
src\core\org\apache\hadoop\filecache\DistributedCache.java:

 *     JobConf job = new JobConf();
 *     DistributedCache.addCacheFile(new
URI("/myapp/lookup.dat#lookup.dat"),
 *                                   job);


On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
<la...@gmail.com>wrote:

> I'm trying to use the distributed cache in a MapReduce job written to the
> new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file path is
> added to the distributed cache as follows:
>
>    public int run(String[] args) throws Exception {
>        Configuration conf = getConf();
>        Job job = new Job(conf, "Job");
>        ...
>        DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
>        ...
>        return job.waitForCompletion(true) ? 0 : 1;
>    }
>
> The "setup()" method in my mapper tries to read the path as follows:
>
>    protected void setup(Context context) throws IOException {
>        Path[] paths = DistributedCache.getLocalCacheFiles(context
>                .getConfiguration());
>    }
>
> But "paths" is null.
>
> I'm assuming I'm setting up the distributed cache incorrectly. I've seen a
> few hints in previous mailing list postings that indicate that the
> distributed cache is accessed via the Job and JobContext objects in the
> revised API, but the javadocs don't seem to support that.
>
> Thanks.
> Larry
>