You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by michaelz <wb...@hotmail.com> on 2011/02/16 17:36:19 UTC

How to use DistributedCache to load data generated from a previous MapReduce job?

I have a MapReduce job #1, which processes input files, and produces <key,
value> pairs data. These key-value pairs data are stored as sequenceFile
under 100 directories:
001/
002/
.....
100/

Now, I have another MapReduce job #2. I am trying to load the data from the
100 directories into DistributedCache, so that job #2 can access the cached
data quickly. Any approach to do that?

My first thought is to call DistributedCache.addCacheFile(...) and
DistributedCache.getLocalCacheFiles(...) for 100 times. But that seems not
the right way to do for my problem. Any other solutions?

-- 
View this message in context: http://old.nabble.com/How-to-use-DistributedCache-to-load-data-generated-from-a-previous-MapReduce-job--tp30936650p30936650.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


RE: How to use DistributedCache to load data generated from a previous MapReduce job?

Posted by pr...@nokia.com.
Is the input for job2 the output of job1? If so, you don't need to use Distributed cache for accessing the data between jobs. All you need to do is give the outputpath of job1 as the inputpath to job2 in the job config since all mapred jobs will have access to the hdfs. You can also specify the input format class in the job config as Sequence

Path in = new Path("output/path/of/job1");
FileInputFormat.addInputPath(job, in);
job.setInputFormatClass(SequenceFileInputFormat.class);
..
..

Hope this helps
Praveen

-----Original Message-----
From: ext michaelz [mailto:wbzhang2k@hotmail.com] 
Sent: Wednesday, February 16, 2011 11:36 AM
To: core-user@hadoop.apache.org
Subject: How to use DistributedCache to load data generated from a previous MapReduce job?


I have a MapReduce job #1, which processes input files, and produces <key,
value> pairs data. These key-value pairs data are stored as sequenceFile
under 100 directories:
001/
002/
.....
100/

Now, I have another MapReduce job #2. I am trying to load the data from the 100 directories into DistributedCache, so that job #2 can access the cached data quickly. Any approach to do that?

My first thought is to call DistributedCache.addCacheFile(...) and
DistributedCache.getLocalCacheFiles(...) for 100 times. But that seems not the right way to do for my problem. Any other solutions?

--
View this message in context: http://old.nabble.com/How-to-use-DistributedCache-to-load-data-generated-from-a-previous-MapReduce-job--tp30936650p30936650.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.